Feature selection for high dimensional microarray gene expression data via weighted signal to noise ratio

Feature selection in high dimensional gene expression datasets not only reduces the dimension of the data, but also the execution time and computational cost of the underlying classifier. The current study introduces a novel feature selection method called weighted signal to noise ratio (WSNR) by exploiting the weights of features based on support vectors and signal to noise ratio, with an objective to identify the most informative genes in high dimensional classification problems. The combination of two state-of-the-art procedures enables the extration of the most informative genes. The corresponding weights of these procedures are then multiplied and arranged in decreasing order. Larger weight of a feature indicates its discriminatory power in classifying the tissue samples to their true classes. The current method is validated on eight gene expression datasets. Moreover, results of the proposed method (WSNR) are also compared with four well known feature selection methods. We found that the (WSNR) outperform the other competing methods on 6 out of 8 datasets. Box-plots and Bar-plots of the results of the proposed method and all the other methods are also constructed. The proposed method is further assessed on simulated data. Simulation analysis reveal that (WSNR) outperforms all the other methods included in the study.


Introduction
Feature/Gene selection in micro-array gene expression datasets has gained great attention during the recent decades [1][2][3][4][5][6][7]. Since high dimensional datasets usually contain noisy, redundant and non-informative features that enhance computational complexity as well as execution time of the underlying model. Feature selection is therefore, necessary to select the informative features and remove the unnecessary ones. This will not only reduce execution or training time but will also increase the accuracy of the model. Based on this model one can categorize the samples in the data into their classes [8]. Feature selection is mainly carried out by using three different methods such as wrapper, filter and embedded. The feature selection methods used in paper falls under the category of filter methods, except sigF [9] which is a wrapper method. Features or variables selection is used in variety of task such as classification, regression and clustering [10]. Also, different types of biological data sets can be analyzed by using feature selection, for instance whole-genome sequencing data set [11], protein mass spectra data set [12], whole-genome expression data set [13][14][15], and so on. Micro-array and other high throughput technologies are capable of measuring thousands of genes simultaneously, leading to its rampant usage in clinical settings. Recent years have witnessed a lot of feature selection methods for micro-array data analysis. Authors in [16] introduced a method known as 'double feature selection method'. In their method they have used both the global and intrinsic geometric information, for the selection of informative features in data. Similarly, study in [17] introduced a method that handles semi-supervised feature selection tasks. This method combines neighborhood discriminant index (NDI) and forward iterative Laplacian score (FILS) methods for the selection of discriminative features in high-dimensional data sets. A more efficient implementation of linear support vector machines to improve the recursive feature elimination strategy and then combine them together to select informative genes was proposed in [18]. A study in [2] proposed a new technique that applies an ensemble of feature selection procedures to select those genes that are highly correlated to Lung Adenocarcinoma (LUAD). Utilizing LUAD RNA-seq data from the Cancer Genome Atlas (TCGA), mutual information (MI) was employed followed by recursive feature elimination (RFE) feature selection procedures along with SVM classifier. A new Bi-dimensional Principal Feature Selection (BPFS) procedure for efficiently extracting critical genes was proposed for high dimensional gene expression datasets [19]. This procedure utilizes the principal component analysis (PCA) technique on sample and gene domains successively, in order to identify the informative genes and reduce redundancies while losing less information. The selection of informative features and their importance in classification/regression can be found in [20][21][22][23][24][25][26][27]. The main focus of these methods is to enhance the classification accuracy of the underlying classifier with the help of selected genes, while ignoring their biological relevance, which leads to inaccurate downstream data analysis [28][29][30][31][32][33]. Therefore, it is necessary to device such a feature selection method that not only increase the classification accuracy, but also to be capable of identifying the biological significance of the selected genes, in tumor versus normal tumor contrast [34,35]. This paper proposes a new feature selection procedure by combining the information obtained from well known feature selection method called signal to noise ratio (SNR) [40] and the feature weights given by support vector machine (SVM) [36]. For assessing the performance of the current study eight gene expression datasets, i.e., Leukemia, Colon, Srbct, DLBCL, Lungcancer, Breastcancer, TumorC and Prostate have been used. Furthermore, the results of the proposed method are compared with four other well known feature selection methods such as significant features "sigF" [9], minimum redundancy maximum relevance (mRmR) [37], wilcoxon rank sum test "Wilc" [38] and an ensemble method called SVM-mRMRe [39]. After comparing the results of the proposed method (W SNR ) with the aforemesioned methods, it has been observed that the proposed W SNR stands apart in terms of classification error. Box-plots and bar-plots of the results are also constructed, which also indicate that the proposed method has better performance as compared to the aforementioned feature selection methods. The rest of the paper is organized as follows.
Section 2 gives a detailed description of the datasets used in the paper, support vector machine (SVM) classifier, feature selection procedures "Significant Features" (sigF) [9], Signal to Noise Ratio (SNR) [40], and the proposed method (W SNR ) with its mathematical background and algorithm. Section 3 presents the experimental set up of the proposed method.
Section 4 gives discussion on the results of the proposed method W SNR . The paper is concluded in Section 5.

Data sets
For the assessment of the proposed method, W SNR , eight benchmark problems are used. Their sources along with number of features, number of observations and class wise distribution of samples are given in Table 1.

Support vector machine
Support vector machine (SVM) is a supervised learning technique, which has been widely used for regression and classification problems in literature. It has also been used for feature selection in several studies [32,33,48]. This classifier utilizes several kernel functions to perform the classification effectively in linear and non-linear feature spaces. The SVM searches a linear or non-linear optimal hyperplane (H), which can then divide the two groups of observations meaningfully [49]. This hyperplane (H) is supposed to be at maximum distance from both the classes or groups in high-dimensional spaces, so as to separate the two groups as much as possible. The hyperplane is represented in the form of a vector given in Eq 1 which acts as a reference frame to identify the position of each sample or observation in high-dimensional spaces. It is summed in order to produce a score known as discriminate score, which is then used to categorize the observations into one of the two classes.
where y is a response vector, i.e., y 2 (0, 1), where each sample in the data is classified into class 0 or 1. z = (z 1 , � � �, z d ) is a d-dimensional input vector and vector w = (w 1 , � � �, w d ) contains the coefficients of the hyperplane. The term b indicates the intercept of the hyperplane.

Mathematical description behind SVM weights w.
As the SVM algorithm uses a hyperplane (H) to classify the data points in their respective classes, i.e., The distance between a given point ψ(z 0 ) and the hyperplane H is give by where kwk 2 is the Euclidean norm given as kwk 2 ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi The weight vector is the argument that maximize the distance given in Eq 2, that is:

Significant feature selection (sigF)
A method known as Signature feature selection (sigF) can be found in [9]. In this method, significant features are identified with the help of support vector machine and t-test. First, the weight of each feature is computed via support vector machine (SVM). In the second stage, ttest is computed for each feature in the data in the following manner: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where � z 0 j ; � z 1 j ; s 0 j ; s 1 j ; n 0 ; n 1 represent the means, standard deviations and number of samples in Class 0 and 1, respectively. In this way the t-statistic is computed for each feature in the data. Alternatively, p-values for all the features in the data are computed based on t-test. A smaller p-value of a feature represents its disciminative ability. The weights computed via SVM classifier are then multiplied with these p-values to achieve new weights of all the features by using the following equation.
where v is the level of significance for the corresponding reference distribution and u is the observed value of test statistic based on the level of flexibility v. The feature is considered informative if it possesses a smaller value of ξ.

The proposed method, W SNR
The proposed method selects the informative genes or features in high-dimensional gene expression data sets in a similar fashion as that of sigF given in [9]. The only difference is that the method in [9] computes t-statistic for each feature, which is then multiplied with the weights computed via support vector machine classifier. The proposed method on the other hand computes signal to noise ratio [40] for each feature in the following manner.
where � z 0 j ; � z 1 j ; s 0 j ; s 1 j represent the mean and standard deviations of class 0 and 1, respectively. Features that carry larger value of SNR, are supposed to have greater discriminative ability. Similarly weights of all the features in the data are also computed via SVM, i.e., w j . Since both the weighting schemes assign larger weights to the informative genes therefore, their multiplication will also assign larger weights to the features that are informative. The resultant weights of the proposed method (W SNR ) are computed by using the following equation where (W SNR ) j represents the weight of j th feature in the data. The proposed method (W SNR ) considers the following steps in identifying the informative genes.
• Compute weights of all the features using support vectors and denote it by w j .
• Compute signal to noise ratio for all the features in the training data and denote it by SNR j .
• Multiply the corresponding weights in step 1 and 2 and arrange them in descending order.
• Select the top ranked (K) genes in step 3 for the model construction.
The authors in [9] have used t-test rather than signal to noise ratio for the selection of discriminative genes. The t-test requires the underlying distribution of variables to be approximately normal, which is a difficult task in a situation where data contains tens of thousand of genes or variables. On the other hand signal to noise ratio does not require such assumption. The following pseudo code given in Algorithm 1 explains how the proposed method, W SNR , identifies the informative genes, in high-dimensional gene expression data sets, followed by its flowchart in Fig 1. Algorithm 1 Pseudo code of the proposed method, W SNR . SNR j Compute the using signal to noise ratio; 11: Perform (W SNR ) j = w j * SNR j ; 12: end for 13: Arrange the weights (W SNR ) j in decreasing order; 14: Select the top K genes for model construction.

Experiments
This section provides the experimental setup of the current paper. Eight high-dimensional gene expression benchmark problems are analyzed, where each benchmark problem is split into (70%) training and (30%) testing parts. This splitting criteria is repeated 500 times for all feature selection procedures and the classifiers used for assessing their performance. Random forest (RF) and k-Nearest Neighbours (k-NN) classifiers have been used to evaluate the performance of different subsets of informative genes selected by various feature selection techniques.
The feature selection method, minimum redundancy and maximum relevance (mRmR), is implemented in R package mRMRe [50]. Wilcoxon rank sum test (Wilc) and significant feature selection (sigF) are implemented by using the R packages WilcoxCV [51] and sigFeature [9], respectively. Moreover, the R library randomForest [52] is utilized for fitting the random forest algorithm with default parameters, i.e., ntree = 500, mtry ¼ ffi ffi ffi p p and nodesize = 1. Similarly, the R library caret [53] is used for the implementation of k-Nearest neighbours classifier, with parameter k = 5.
The training parts of each benchmark problem are considered for the selection of different subsets of descriminative genes, i.e., K = 5, 10 and 15 by different gene selection procedures to train the classifiers. Classification error rate is used as a performance metric to investigate the classifiers' performance on the basis of selected set of informative genes. Table 2 provides the classification error rates produced by the proposed method, W SNR , and all the other competitors included in the study, for different subsets of informative genes. From Table 2, it is evident that for the data set "Leukemia" the proposed method has outperformed all the other methods on both the classifiers. In the case of "Colon" data set, the proposed method has outperformed the others on random forest classifier for all subsets of descriminative genes, while on k-nearest neighbour classifier the method (sigF) has produced minimum error for a subset of 5 informative genes. The proposed method, however, has produced minimum error rates for the subsets of genes 10 and 15. Similarly, in the case of "Lungcancer" data set, the method (Wilc) has yielded minimum error rates on random forest classifier while the proposed method has outperformed all the other competitors on k-NN classifier. In the case of "Srbct" data set, the proposed W SNR method has outperformed all the other methods except for the number of 5 informative genes, where the method "sigF" has yielded minimum error rate on k-NN classifier. The proposed method has outperformed all the other methods on random forest classifier in the case of the dataset "DLBCL" and has shown poor performance on kNN classifier. Similarly, the W SNR method has won over all the other procedures in majority of the cases for the data set "Breast" but has shown poor performance in case of "TumorC" data set. Similarly, the proposed method has won over all the other methods in case of Prostate data set. Overall, the method, W SNR , has produced minimum error rates in six out of eight data sets and comparable results on one data set. To summarize simulation results, a win-loss summary is given in Table 3.

Results and discussion
The performance of the proposed method is also illustrated with the help of bar-plots of the results for pictorial illustration as given in Figs 2-9. It is clear from the plots that in case of the data set "Leukemia" the heights of bars corresponding to the proposed method, W SNR , are

PLOS ONE
Feature selection via weighted signal to noise ratio  smaller than the bars corresponding to all the other procedures included in the study. In case of data set "Lungcancer" the method "Wilc" is producing minimum error rates than the rest of the gene selection procedures. For the data sets "Srbct" and "DLBCL", the method, W SNR , method has produced minimum classification error rates. For the remaining data sets, our method has maintained a majority wining position except for the data set "TumorC".   been constructed for a quick insight into the results of various feature selection methods included in the study. Similarly, box-plots of the results produced by the method, W SNR , and all the other competitors for 10 number of informative genes on random forest classifier are also constructed as

Simulation
This subsection describes two simulation scenarios for the proposed method. The first scenario (S 1 ) is designed to mimic a situation where the proposed method is useful, whereas the  second scenario (S 2 ) shows a data generation environment that might not favour the proposed method. For this purpose, two different models are designed, one for each scenario. The class probabilities of the Bernoulli response Y = Bernoulli(p) given n × d dimensional matrix X of n iid observations from Normal(0, 1) and Uniform(0, 1) distributions, are generated in each  scenario by using the following equation.
The values of a and b are both fixed at 1.5. A vector of coefficients, i.e., β is generated from the Uinform(−5, 5) distribution to fit the following linear predictor.
Top five, i.e., K = 5, important variables are identified from the above model based on their coefficients β s . In order to contaminate the data, outliers are added to these top five variables from the Normal (20,60) distribution. In addition, 20 noisy variables/observations are also added to the data from Normal(5, 10) distribution. By this way a simulated data having n = 100 observations and d = 120 variables is generated. For all the methods considered, the same experimental set is used as that of the benchmark data sets. The second model is also constructed in a similar fashion. The difference between the two models is that, the former contains outliers and noisy variables/observations in the data, while the latter one does not contain outliers and noisy variables in the data. A total of 500 realizations are made for estimating the performance metrics values. The results of the simulation study for both the scenarios are presented in Table 4.

PLOS ONE
From Table 4, it is evident that, when there are noisy variables/observations in the data, the proposed method, W SNR , performs better than the other competitors, whereas the method (Wilc) produces minimum error rates, when there are no noisy variables/observations in the data. Similarly, bar-Plots of error rates for different subsets of genes, when the simulated data contains noisy genes/observations in the data and when there are no noisy features/observations are also constructed as given in Figs 18 and 19, respectively. The plots indicate that the proposed method, W SNR , is producing minimum error rates in the presence of noisy features/ observations in the data.

Conclusion
The current study has proposed a novel feature selection method by exploiting feature weighting via support vectors and signal to noise ratio (SNR). The proposed method initially computes the weights of all genes using support vector machine, followed by the computation of signal to noise ratio for all the genes in the training phase. These weights are then multiplied to compute new weights for each gene in the data. Genes are then arranged in decreasing order of their weights. Top ranked genes are then selected for model construction. The proposed method is validated on eight benchmark problems and assessment is made against other methods in terms of classification error rates. The results of the proposed method are compared with four well known feature selection methods. Two stat-of-the-art classifiers, i.e., random forest (RF) and k-NN are used to evaluate the performance of the selected genes by various feature selection methods. The analyses revealed that the proposed method, W SNR , has out performed all the other methods in 6 out of 8 data sets and has produced comparable results on 2 data sets. For quick insight into the results of the proposed method and all the other methods, bar-plots and box-plots of the results have also been constructed. Furthermore, the proposed method is also evaluated on the simulated data where two scenarios are generated. First, a scenario which favors the proposed idea where data consist of noisy features and outlier observations. Second, a scenario where there are no noisy features and outlier observations in the data which does not favor the proposed method. From all the analysis, it is concluded that the proposed method could effectively be used in high dimensional settings where the underlying distribution of observations is not known, as is the case with micro-array data.
For future work in the direction of the proposed study, one can extend it to the situation of unsupervised learning, where the features will first be divided into clusters, and then the proposed method applied in each cluster. The top ranked genes in each cluster can then be selected for the model construction. One can also use the robust measures of location and dispersion in conventional signal to noise ratio to mitigate the effect of outliers in gene expression values. In addition, the performance of the proposed method can be checked by using various kernel functions in SVM.